home *** CD-ROM | disk | FTP | other *** search
- Data Compression Techniques
- Using SQ and USQ
-
- Mark DiVecchio
- San Diego IBM PC Users Group
-
- In any computer system, efficient management of the file storage
- space is important. The two programs SQ and USQ reduce the size of
- data files by using the Huffman Data Compression Algorithm.
-
- A file is composed of a sequence of bytes. A byte is composed of 8
- bits. That byte can have a decimal value between 0 and 255. A
- typical text file, like a C language program, is composed of the
- ASCII character set (decimal 0 to 127). This character set only
- requires seven of the eight bits in a byte. Just think -- if there
- were a way to use that extra bit for another character, you could
- reduce the size of the file by 12.5%. For those of you who use
- upper case characters only, you use only about 100 of the ASCII
- codes from 0 to 255. You could save 60% of your storage if the bits
- not needed within each byte could be used for other characters.
- What if you could encode the most frequently used characters in the
- file with only one bit? For example, if you could store the letter
- "e" (the most commonly used letter in the English language) using
- only one bit : "1", you could save 87.5% of the space that the "e"
- would normally take if it were stored as the ASCII "0110 0101"
- binary.
-
- The answer to these questions is the SQ and USQ programs.
-
- The SQ program uses the Huffman coding technique to search for the
- frequency of use of each of the 256 possible byte patterns, and it
- then assigns a translation for each character to a bit string. All
- of these bit strings are placed end to end and written onto a disk
- file. The encoding information is also put on the file since the
- USQ program needs to know the character distribution of the original
- file.
-
- The USQ program reads in the encoding information and then reads in
- the encoded file. It is a simple matter to scan the encoded file
- and produce an output file which is identical to the file that SQ
- started with.
-
- Huffman Coding Technique
-
- This is by far the most popular encoding technique in use today.
- The Huffman encoding replaces fixed bit characters with variable
- length bit strings. The length of the bit string is roughly
- inversely proportional to the frequency of occurrence of the
- character. For those of you inclined to such symbolism:
-
- Length of bit ~= log (character
- string 2 probability)
-
- The implementation of the algorithm which we will discuss encodes
- fixed bit strings of length 8.
-
- This algorithm requires two passes through the input file. The first
- pass builds a table of 256 entries showing the frequency of each
- occurrence of each of the 256 possible values for a byte of
- information.
-
- Once the counting is complete, the algorithm must decide which bit
- strings to associate with each of the 256 characters that were found
- in the file. Note that if a particular byte value was never used,
- no string association is needed.
-
- The second pass through the input file converts each byte into its
- encoded string. Remember that when the output file is created, the
- information used for encoding must also be written on the file for
- use by the decoding program.
-
- The decoding program reads the encoding information from the file
- and then starts reading the bit strings. As soon as enough bits are
- read to interpret a character, that character is written onto the
- final output file. See the next two sections on how SQ and USQ
- actually implement this.
-
- Even though this article primarily has addresses ASCII input files,
- there is nothing which restricts this algorithm to ASCII. It will
- work on binary files (.COM or .EXE) as well. But since the length of
- the encoded bit string is approximately equal to the inverse of the
- frequency of occurrence of each 8 bit byte, a binary file may not
- compress very much. This is because a binary file most likely has a
- uniform distribution over the 256 values in a byte. A machine
- language program is not like the English language where the letter
- "e" is used far more than other letters. If the distribution is
- uniform, the encoded bit strings will all be the same length and the
- encoded file could be longer than the original (because of the
- encoding information on the front of the file). All of this has to
- be qualified, because machine language programs tend to use a lot of
- "MOV" instructions and have a lot of bytes of zeros so that encoding
- .COM and .EXE files does save some disk space.
-
- SQ
-
- The SQ program is an example of the Huffman algorithm.
-
- The first thing that SQ does is read through the input file and
- create a distribution array for the 256 possible characters. This
- array contains counts of the number of occurrences of each of the
- 256 characters. The program counts these values in a 16 bit number.
- It makes sure that, if you are encoding a big file, counts do not
- exceed a 16 bit value. This is highly unlikely, but it must be
- accounted for.
-
- At the same time, SQ removes strings of identical characters and
- replaces them with the ASCII character DLE followed by a character
- count of 2-255. SQ replaces the ASCII DLE with the pair of
- characters: DLE DLE. This is not related to the Huffman algorithm
- but just serves to compress the file a little more.
-
- Once SQ has scanned the input file, it creates a binary tree
- structure containing this frequency occurrence information. The most
- frequently occurring characters have the shortest path from the root
- to the node, and the least frequently occurring characters have the
- longest path. For example, if your file were:
-
- ABRACADABRA (a very simple and
- magical example)
-
- The table of frequency of occurrences would be:
-
- Letter # of Occurrences
- ------ ---------------
- A 5
- B 2
- C 1
- D 1
- R 2
- all the rest 0
-
- Since the letter "A" occurs most often, it should have the shortest
- encoded bit string. The letters "C" and "D" should have the
- longest. The other characters which don't appear in the input file
- don't need to be considered.
-
- SQ would create a binary tree to represent this information. The
- tree might look something like this (for purposes of discussion
- only):
-
- root <---Computer trees are
- / \ always upside down!
- 1 0 <-- This is called a
- / / \ node.
- A 1 0
- / / \ <--This is called
- B 1 0 branch.
- / / \
- C 1 0
- / \
- D R <-This
- is a
- leaf
-
-
- From this our encoded bit strings which are kept in a translation
- table would be:
-
- Table Entry Character Binary
- ----------- --------- ------
- 1 A 1
- 2 B 01
- 3 C 001
- 4 D 0001
- 5 R 0000
-
-
- The output file would be:
-
-
- A B R A C A D A B R A
- ----------------------------------
- 1 01 0000 1 001 1 0001 1 01 0000 1
- (binary)
-
- A1 31 A1
- (hex)
-
-
- We have reduced the size of your file from ten bytes to three bytes
- for a 70% savings. For this simple example, things aren't that well
- off since we must put the binary tree encoding information onto the
- file as well. So the file size grew a lot. But consider a file
- with the word ABRACADABRA repeated 100,000 times. Now the encoding
- information is going to be a very very small percentage of the
- output file and the file will shrink tremendously.
-
- SQ opens the output file and writes out the binary tree information.
- Then SQ rewinds the input file and rereads it from the beginning.
- As it reads each character, it looks into the translation table and
- outputs the corresponding bit string.
-
- SQ is a little more complicated than what I have outlined since it
- must operate in the real world of hardware, but this is a fairly
- complete description of the algorithm.
-
- USQ
-
- The USQ program is very straightforward. It reads in the encoding
- information written out by SQ and builds the identical binary tree
- that SQ used to encode the file.
-
- USQ then reads the input file as if it were a string of bits.
- Starting at the root of the tree, it traverses one branch of the
- tree with each input bit. If it has reached a leaf, it has a
- character and that character is written to the output file. USQ
- then starts at the root again with the next bit from the input file.
-
- What does it all mean?
-
- Now that we understand the algorithm and a little about how the SQ
- and USQ programs work, we can use that knowledge to help us run our
- systems a little more efficiently. (So all of this theory is worth
- something after all!).
-
- 1. Files must be above a threshold size, or else the output file
- will be longer than the input file because of the decoding
- information put at the beginning of the compressed data. We don't
- know the exact size of the threshold because the encoding binary
- tree information depends on the distribution of the characters in a
- file. At least we know to check the size of the encoded file after
- we run SQ to make sure our file didn't grow.
-
- 2. Some files will not compress well if they have a uniform
- distribution of byte values, for example, .COM or .EXE files. This
- is because of the way SQ builds the tree. Remember that bytes with
- the same frequency of occurrence are at the same depth (usually) in
- the tree. So if all of the bytes have the same depth, the output
- strings are all the same length.
-
- 3. SQ reads the input file twice. If you can, use RAM disk at
- least for the input file and for both files if you have the room.
- The next best case is to use two floppy drives, one for input and
- one for output. This will cause a lot of disk starts and stops but
- not much head movement. Worst case is to use one floppy drive for
- both input and output. This will cause a lot of head movement as
- the programs alternate between the input and output files.
-
- Other Compression Techniques
-
- RUN-LENGTH ENCODING
-
- Run-length encoding is a technique whereby sequences of identical
- bytes are replaced by the repeated byte and a byte count. As you
- might guess, this method is effective only on very specialized
- files. One good candidate is a screen display buffer. A screen is
- made up mostly of "spaces". A completely blank line could be
- reduced from 80 bytes of spaces to one space followed by a value of
- 80. To go from 80 bytes down to two bytes is a savings of almost
- 98%. You might guess that for text files or binary files, this
- technique does not work well at all.
-
- ADAPTIVE COMPRESSION
-
- This technique replaces strings of characters of code. For example,
- the string "ABRACADABRA" would be replaced by a code. Typical
- algorithms use a 12 bit code. The algorithm is unique in that it
- only requires a single pass through the input file as the encoding
- is taking place. The current incarnation of this procedure is
- called the LZW method (after co-inventors; A. Lempel, J. Ziv and
- T. Welch). This algorithm claims a savings of 66% on machine
- language files and up to 83% on COBOL files.
-
- Other Reading
-
- If you are interested in reading more about data compression
- techniques, you may be interested in these articles:
-
-
- H.K. Reghbati, "An Overview of Data Compression Techniques,"
- Computer Magazine, Vol. 14, No. 4, April 1981, pp. 71-76.
-
- T.A. Welch, "A Technique for High-Performance Data Compression",
- Computer Magazine, Vol 17, No. 6, June 1984, pp. 8-19.
-
- J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data
- Compression," IEEE Transactions on Information Theory, Vol. It-23,
- No.3, May 1977, pp. 337-343.
-